Welcome to the follow-up to the “Intro to R”! If you see this you have managed to download the files from GitHub, which already makes you smarter than tolly500:

This means now it’s…

Setup time

If you haven’t already:

  1. Install R:
  1. Install RStudio from the official website (please)

  2. Install packages:

  1. Make yourself feel at home: If you’re done early, under Tools > Global Options > Appearance you can change the theme etc

The tidyverse and “tidy” data

In this workshop, we will use the tidyverse, which is a collection of R packages designed to make data manipulation and analysis as easy and coherent as possible. The tools are based around the concept of “tidy data”:

Artworks by @allison_horst

Today you will learn about the tools you can use with tidy data (including visualizations), and tomorrow you will learn how to (1) keep your projects tidy as well, and (2) turn messy into tidy data.

For our examples, we will use data from Gapminder for our examples, which is provided as an R-package:

library(gapminder)
data("gapminder")

dplyr-verbs

library(dplyr)

dplyr is a “grammar” for data manipulation & the core of the tidyverse. It follows the logic of “verbs”:

Click here to get to the dplyr-documentation, which is really comprehensive with many examples!

Selecting:

If you look at ?select, you will see that it has two arguments:

  1. .data, which is the data frame from which to select variables, and
  2. ..., which variables to select.

For example:

select(.data = gapminder, country, year, lifeExp)
#> # A tibble: 1,704 × 3
#>    country      year lifeExp
#>    <fct>       <int>   <dbl>
#>  1 Afghanistan  1952    28.8
#>  2 Afghanistan  1957    30.3
#>  3 Afghanistan  1962    32.0
#>  4 Afghanistan  1967    34.0
#>  5 Afghanistan  1972    36.1
#>  6 Afghanistan  1977    38.4
#>  7 Afghanistan  1982    39.9
#>  8 Afghanistan  1987    40.8
#>  9 Afghanistan  1992    41.7
#> 10 Afghanistan  1997    41.8
#> # ℹ 1,694 more rows

Another way of writing the same thing is using the “pipe” operator (|> or %>%)1 : it takes what is on its left-hand side and hands it to the function on the right-hand side2. If this sounds complicated, just read the pipe in your head as “and then…”:

gapminder |> select(country, year, lifeExp)
#> # A tibble: 1,704 × 3
#>    country      year lifeExp
#>    <fct>       <int>   <dbl>
#>  1 Afghanistan  1952    28.8
#>  2 Afghanistan  1957    30.3
#>  3 Afghanistan  1962    32.0
#>  4 Afghanistan  1967    34.0
#>  5 Afghanistan  1972    36.1
#>  6 Afghanistan  1977    38.4
#>  7 Afghanistan  1982    39.9
#>  8 Afghanistan  1987    40.8
#>  9 Afghanistan  1992    41.7
#> 10 Afghanistan  1997    41.8
#> # ℹ 1,694 more rows

Take gapminder, and then select the variables country, year and lifeExp. Some other useful tricks include:

# Dropping columns (negative selection):
gapminder |> select(-continent)
gapminder |> select(-c(continent, pop))

# Selecting columns by name patterns:
gapminder |> select(starts_with("c"))
gapminder |> select(ends_with("p"))

# Select from:to
gapminder |> select(country:lifeExp)

You can also select everything(), which selects… well, everything, kind of. You can use it e.g. to reorder columns (here, it will select those explicitly called upon first, and then everything() else):

gapminder |> select(pop, gdpPercap, everything())
#> # A tibble: 1,704 × 6
#>         pop gdpPercap country     continent  year lifeExp
#>       <int>     <dbl> <fct>       <fct>     <int>   <dbl>
#>  1  8425333      779. Afghanistan Asia       1952    28.8
#>  2  9240934      821. Afghanistan Asia       1957    30.3
#>  3 10267083      853. Afghanistan Asia       1962    32.0
#>  4 11537966      836. Afghanistan Asia       1967    34.0
#>  5 13079460      740. Afghanistan Asia       1972    36.1
#>  6 14880372      786. Afghanistan Asia       1977    38.4
#>  7 12881816      978. Afghanistan Asia       1982    39.9
#>  8 13867957      852. Afghanistan Asia       1987    40.8
#>  9 16317921      649. Afghanistan Asia       1992    41.7
#> 10 22227415      635. Afghanistan Asia       1997    41.8
#> # ℹ 1,694 more rows

Filtering: Subset your data, keeping rows that match a condition. Useful logics:

gapminder |> filter(continent == "Asia")
#> # A tibble: 396 × 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # ℹ 386 more rows
gapminder |> filter(year != 2007)
#> # A tibble: 1,562 × 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # ℹ 1,552 more rows

gapminder |> 
  filter(year == 2007 & continent %in% c("Asia", "Europe") & pop > 1e7)
#> # A tibble: 37 × 6
#>    country        continent  year lifeExp        pop gdpPercap
#>    <fct>          <fct>     <int>   <dbl>      <int>     <dbl>
#>  1 Afghanistan    Asia       2007    43.8   31889923      975.
#>  2 Bangladesh     Asia       2007    64.1  150448339     1391.
#>  3 Belgium        Europe     2007    79.4   10392226    33693.
#>  4 Cambodia       Asia       2007    59.7   14131858     1714.
#>  5 China          Asia       2007    73.0 1318683096     4959.
#>  6 Czech Republic Europe     2007    76.5   10228744    22833.
#>  7 France         Europe     2007    80.7   61083916    30470.
#>  8 Germany        Europe     2007    79.4   82400996    32170.
#>  9 Greece         Europe     2007    79.5   10706290    27538.
#> 10 India          Asia       2007    64.7 1110396331     2452.
#> # ℹ 27 more rows

Now that we know two functions, we can see why piping is one of the most powerful ideas in R. Say we want country names, life expectancy & GDP per capita for all European countries in 2007. We can combine a filter() and a select()ion into a pipeline:

gapminder |> 
  filter(continent == "Europe" & year == 2007) |> 
  select(country, lifeExp, gdpPercap)
#> # A tibble: 30 × 3
#>    country                lifeExp gdpPercap
#>    <fct>                    <dbl>     <dbl>
#>  1 Albania                   76.4     5937.
#>  2 Austria                   79.8    36126.
#>  3 Belgium                   79.4    33693.
#>  4 Bosnia and Herzegovina    74.9     7446.
#>  5 Bulgaria                  73.0    10681.
#>  6 Croatia                   75.7    14619.
#>  7 Czech Republic            76.5    22833.
#>  8 Denmark                   78.3    35278.
#>  9 Finland                   79.3    33207.
#> 10 France                    80.7    30470.
#> # ℹ 20 more rows

Take gapminder, and then filter for continent == "Europe and year == 2007, and then select the variables country, lifeExp and gdpPercap. If we want to keep this, we can then assign the result of our pipeline:

europe <- 
  gapminder |> 
  filter(continent == "Europe" & year == 2007) |> 
  select(country, lifeExp, gdpPercap)

Since R supports right-hand assignment, we can write this even more intuitively:

gapminder |> 
  filter(continent == "Europe" & year == 2007) |> 
  select(country, lifeExp, gdpPercap) -> europe

Be warned though that many people consider right-hand assignment “bad practice” and more of a gimmick (for example it’s hard to see that there is an assignment operation happening here if you just skim the code).

Arranging:

gapminder |> arrange(lifeExp)
#> # A tibble: 1,704 × 6
#>    country      continent  year lifeExp     pop gdpPercap
#>    <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
#>  1 Rwanda       Africa     1992    23.6 7290203      737.
#>  2 Afghanistan  Asia       1952    28.8 8425333      779.
#>  3 Gambia       Africa     1952    30    284320      485.
#>  4 Angola       Africa     1952    30.0 4232095     3521.
#>  5 Sierra Leone Africa     1952    30.3 2143249      880.
#>  6 Afghanistan  Asia       1957    30.3 9240934      821.
#>  7 Cambodia     Asia       1977    31.2 6978607      525.
#>  8 Mozambique   Africa     1952    31.3 6446316      469.
#>  9 Sierra Leone Africa     1957    31.6 2295678     1004.
#> 10 Burkina Faso Africa     1952    32.0 4469979      543.
#> # ℹ 1,694 more rows
gapminder |> arrange(desc(pop))
#> # A tibble: 1,704 × 6
#>    country continent  year lifeExp        pop gdpPercap
#>    <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
#>  1 China   Asia       2007    73.0 1318683096     4959.
#>  2 China   Asia       2002    72.0 1280400000     3119.
#>  3 China   Asia       1997    70.4 1230075000     2289.
#>  4 China   Asia       1992    68.7 1164970000     1656.
#>  5 India   Asia       2007    64.7 1110396331     2452.
#>  6 China   Asia       1987    67.3 1084035000     1379.
#>  7 India   Asia       2002    62.9 1034172547     1747.
#>  8 China   Asia       1982    65.5 1000281000      962.
#>  9 India   Asia       1997    61.8  959000000     1459.
#> 10 China   Asia       1977    64.0  943455000      741.
#> # ℹ 1,694 more rows

Again, combining what we know to find the countries with the highest life expectancy:

gapminder |> 
  filter(continent == "Europe" & year == 2007) |> 
  select(country, lifeExp, gdpPercap) |> 
  arrange(desc(lifeExp))
#> # A tibble: 30 × 3
#>    country     lifeExp gdpPercap
#>    <fct>         <dbl>     <dbl>
#>  1 Iceland        81.8    36181.
#>  2 Switzerland    81.7    37506.
#>  3 Spain          80.9    28821.
#>  4 Sweden         80.9    33860.
#>  5 France         80.7    30470.
#>  6 Italy          80.5    28570.
#>  7 Norway         80.2    49357.
#>  8 Austria        79.8    36126.
#>  9 Netherlands    79.8    36798.
#> 10 Greece         79.5    27538.
#> # ℹ 20 more rows

Think for a moment: how do you read this sequence in your head?

Renaming:

gapminder |> 
  select(country, year, lifeExp, gdpPercap) |> 
  rename("life_exp" = lifeExp, "gdp_pc" = gdpPercap)
#> # A tibble: 1,704 × 4
#>    country      year life_exp gdp_pc
#>    <fct>       <int>    <dbl>  <dbl>
#>  1 Afghanistan  1952     28.8   779.
#>  2 Afghanistan  1957     30.3   821.
#>  3 Afghanistan  1962     32.0   853.
#>  4 Afghanistan  1967     34.0   836.
#>  5 Afghanistan  1972     36.1   740.
#>  6 Afghanistan  1977     38.4   786.
#>  7 Afghanistan  1982     39.9   978.
#>  8 Afghanistan  1987     40.8   852.
#>  9 Afghanistan  1992     41.7   649.
#> 10 Afghanistan  1997     41.8   635.
#> # ℹ 1,694 more rows

Most of this is self-explanatory right? Here is a tip: you can rename inside select(), so you rarely need rename():

gapminder |> 
  select(country, year, "life_exp" = lifeExp, "gdp_pc" = gdpPercap)
#> # A tibble: 1,704 × 4
#>    country      year life_exp gdp_pc
#>    <fct>       <int>    <dbl>  <dbl>
#>  1 Afghanistan  1952     28.8   779.
#>  2 Afghanistan  1957     30.3   821.
#>  3 Afghanistan  1962     32.0   853.
#>  4 Afghanistan  1967     34.0   836.
#>  5 Afghanistan  1972     36.1   740.
#>  6 Afghanistan  1977     38.4   786.
#>  7 Afghanistan  1982     39.9   978.
#>  8 Afghanistan  1987     40.8   852.
#>  9 Afghanistan  1992     41.7   649.
#> 10 Afghanistan  1997     41.8   635.
#> # ℹ 1,694 more rows

Summarising: Compute summary values

gapminder |> 
  filter(year == 2007) |> 
  summarise(mean_life_exp = mean(lifeExp, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   mean_life_exp
#>           <dbl>
#> 1          67.0

You may ask yourself: why not just use mean at this point3? summarise only really shines when combined with group_by() to compute grouped summary statistics or aggregates:

gapminder |> 
  filter(year == 2007) |>
  group_by(continent) |> 
  summarise(mean_life_exp = mean(lifeExp, na.rm = TRUE)) |> 
  arrange(desc(mean_life_exp))
#> # A tibble: 5 × 2
#>   continent mean_life_exp
#>   <fct>             <dbl>
#> 1 Oceania            80.7
#> 2 Europe             77.6
#> 3 Americas           73.6
#> 4 Asia               70.7
#> 5 Africa             54.8

Don’t forget to ungroup() your data (or set .groups = "drop") if you don’t want to do later computations by groups.

Mutating: Add new or overwrite existing variables:

gapminder |> mutate(pop_million = pop / 1e6)
#> # A tibble: 1,704 × 7
#>    country     continent  year lifeExp      pop gdpPercap pop_million
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>       <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.        8.43
#>  2 Afghanistan Asia       1957    30.3  9240934      821.        9.24
#>  3 Afghanistan Asia       1962    32.0 10267083      853.       10.3 
#>  4 Afghanistan Asia       1967    34.0 11537966      836.       11.5 
#>  5 Afghanistan Asia       1972    36.1 13079460      740.       13.1 
#>  6 Afghanistan Asia       1977    38.4 14880372      786.       14.9 
#>  7 Afghanistan Asia       1982    39.9 12881816      978.       12.9 
#>  8 Afghanistan Asia       1987    40.8 13867957      852.       13.9 
#>  9 Afghanistan Asia       1992    41.7 16317921      649.       16.3 
#> 10 Afghanistan Asia       1997    41.8 22227415      635.       22.2 
#> # ℹ 1,694 more rows

gapminder |> 
  mutate(
    pop_million = pop / 1e6,
    gdp = gdpPercap * pop
  )
#> # A tibble: 1,704 × 8
#>    country     continent  year lifeExp      pop gdpPercap pop_million        gdp
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>       <dbl>      <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.        8.43    6.57e 9
#>  2 Afghanistan Asia       1957    30.3  9240934      821.        9.24    7.59e 9
#>  3 Afghanistan Asia       1962    32.0 10267083      853.       10.3     8.76e 9
#>  4 Afghanistan Asia       1967    34.0 11537966      836.       11.5     9.65e 9
#>  5 Afghanistan Asia       1972    36.1 13079460      740.       13.1     9.68e 9
#>  6 Afghanistan Asia       1977    38.4 14880372      786.       14.9     1.17e10
#>  7 Afghanistan Asia       1982    39.9 12881816      978.       12.9     1.26e10
#>  8 Afghanistan Asia       1987    40.8 13867957      852.       13.9     1.18e10
#>  9 Afghanistan Asia       1992    41.7 16317921      649.       16.3     1.06e10
#> 10 Afghanistan Asia       1997    41.8 22227415      635.       22.2     1.41e10
#> # ℹ 1,694 more rows

Other useful stuff:

# count()

gapminder |> 
  filter(year == 2007) |> 
  count(continent) |> 
  arrange(desc(n))
#> # A tibble: 5 × 2
#>   continent     n
#>   <fct>     <int>
#> 1 Africa       52
#> 2 Asia         33
#> 3 Europe       30
#> 4 Americas     25
#> 5 Oceania       2
# n()

gapminder |> 
  filter(year == 2007) |> 
  group_by(continent) |> 
  summarise(n_obs = n(), mean_life_exp = mean(lifeExp, na.rm = TRUE))
#> # A tibble: 5 × 3
#>   continent n_obs mean_life_exp
#>   <fct>     <int>         <dbl>
#> 1 Africa       52          54.8
#> 2 Americas     25          73.6
#> 3 Asia         33          70.7
#> 4 Europe       30          77.6
#> 5 Oceania       2          80.7

For group_by(), if you continue your pipeline after summarise() or whatever you did by group, you may need to ungroup() (otherwise everything will be done by group). If you want to arrange() your grouped data, you may need to set .by_group = TRUE inside of arrange().

If there is one thing to remember from this course, it’s the dplyr-verbs, because they make data manipulation & exploration really easy. They are building blocks with which you can do most operations you will ever need for processing your data.

Exercise Time:

  1. What countries were Top 10 in life expectancy (lifeExp) in 1952? Which were the top 10 in 2007? What changes do you see (in terms of countries and in terms of life expectancy)?
  2. Compute overall by continent:
  1. Mean life expectancy
  2. mean GDP per capita
  1. Add a new variable to the gapminder data set: Total GDP in billion USD (we can assume that GDP = gdpPercap \(\times\) pop). Bonus: In 2007, which countries had the highest total GDP?
# Your code here...

  1. %>% is the original tidyverse-pipe introduced in the magrittr-package, while |> is the base R pipe (has only been around since R4.1). On the surface they don’t really differ, except for a different placeholder (a concept we will discuss later), but the base R pipe is a lot faster, and it looks cooler imo (also it looks like a big arrow to the right if you squint a bit, which conveys its purpose a lot better). Below the surface (for the R nerds), rlang::expr(f(x)) == rlang::expr(x |> f()) is TRUE, while rlang::expr(f(x)) == rlang::expr(x %>% f()) evaluates to FALSE, because the tidyverse-pipe engages in data masking (which makes it slower & produces the infamous traceback-vomit if you get errors in pipelines), while to the R interpreter x |> f() is literally the exact same as f(x), so there is no performance overhead. Consider rlang::expr(x |> f()) even evaluates to f(x). The %>% pipe is a bit more flexible in some edge cases, but I would consider most of them bad practices anyway.↩︎

  2. As the first argument↩︎

  3. Another way to write this would use pull() to pull a single vector out of a dataframe: gapminder |> filter(year == 2007) |> pull(lifeExp) |> mean(na.rm = TRUE)↩︎